========================================================
The Wine quality, the publicly available dataset were created using red and white wine sample.The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For my Explatory Data Analysis, I am considering the Red Wine dataset.
The objective here is to have an initial understanding of
Strucutre of the data: Study the data types, dimension of the data, and sample values:
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Is there any NA values in the dataframe?
##
## FALSE
## 20787
Summary of the data:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The above chart helps to visually explore the nature of the quality of wine from the sample set. Majority of data points are of quality ranging from 5 to 6. More importantly, the sample data has the wine quality ranging only from 3 to 8. Absence of data from the highest and lowest wine quality data potentailly could be vital. This definitely has to be considered when drawing any final conclusion on relationship between the variables
Quality of Wine samples - tabled:
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
The above tabulated value gives us a preliminary understanding of the distribution of the data. Now, I would like to see if there exists any relatioship between the variable that might be of interest to form initial hypothesis.
With fair understanding of the output variable i.e. the quality of the wine, I would now atempt to understand the rest of the variables that might potentially contribute to the wine quality.
## # A tibble: 12 x 4
## key mean_value median_value mode_value
## <chr> <dbl> <dbl> <dbl>
## 1 alcohol 10.4 10.2 9.5
## 2 chlorides 0.0875 0.079 0.08
## 3 citric.acid 0.271 0.26 0
## 4 density 0.997 0.997 0.997
## 5 fixed.acidity 8.32 7.9 7.2
## 6 free.sulfur.dioxide 15.9 14 6
## 7 pH 3.31 3.31 3.3
## 8 quality 5.64 6 5
## 9 residual.sugar 2.54 2.2 2
## 10 sulphates 0.658 0.62 0.6
## 11 total.sulfur.dioxide 46.5 38 28
## 12 volatile.acidity 0.528 0.52 0.6
Count of observations with Citric acid as 0:
## [1] 132
## [1] "Quality vs Grade - Counts of Sample"
##
## low medium high
## 3 10 0 0
## 4 53 0 0
## 5 0 681 0
## 6 0 638 0
## 7 0 0 199
## 8 0 0 18
New factor variable “quality.ordered” is created, with below levels:
## [1] "3" "4" "5" "6" "7" "8"
The dataset has 1599 observations across 12 variables and 1 key variable. There were no NA values in the dataframe. Quality of redwine is of interest here.
Main feature is the quality of the redwine. The purpose is to study if the other features have any influence on the quality of the wine.
I am interested in alcohol,chlorides, citric acid, fixed.acidity, residual.sugar as there are variations in data leading to believe that is possible that some these variation migth explain quality difference.
Cholrides,fixed.acidity,Residual sugar,free sulfur dioxide, Sulphates and total sulphur di oxide are positively skewed and have longer tails.
I created a data frame called rw_long in a long format for easier plotting. Also, I have created a dataframe called rw_long_by_keys that is grouped by features for easier eyeballing of mean, median and mode of the features. Lastly, i have created a dataframe called rw_corr to hold just the numerical variables in order to compute the correlation factor on them
To being with, I would like to understand if there is any relationship among the variables in terms of correlation. I am planning to use ggpairs as a rough cut and refine further using corr plot.
The corplot above provides some insight into the relationships.
Summary of alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
From the above plot, as the quality of the wine increases i.e from 5 to 8, the alcohol content seem to increase as indicated by the median of the sample. Besides it will be prudent to note that within each quality bucket, there is variation in the alcohol content
Summary of sulphates:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The above plot reinforces the positive correlation between sulphates and alcohol and we see that higher sulphate content in the better quality of wine. Having said that, the sulphate distribution is long tailed and as can be seen from above Mean is greater than the Median across the quality bins.
Summary of citric acid:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Variation in citric acid does seem to impact the variation in quality of the red wine as the better quality wine seem to have higher median of citric.acid in comparison to the lower quality wine across the spectrum
Summary of volatile.acidity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## [1] "Summary of volatile.acidity with quality 7"
## volatile.acidity
## Min. :0.1200
## 1st Qu.:0.3000
## Median :0.3700
## Mean :0.4039
## 3rd Qu.:0.4850
## Max. :0.9150
## [1] "Summary of volatile.acidity with quality 8"
## volatile.acidity
## Min. :0.2600
## 1st Qu.:0.3350
## Median :0.3700
## Mean :0.4233
## 3rd Qu.:0.4725
## Max. :0.8500
The above chart indicates the negative correlation between volatile.acidity & quality and that the higher quality wine have lesser volatile.acidity content Note that the median of volatile.acidity is 0.3700 for red wine samples with quality rating of both 7 & 8
It is very interesting to note that pH vs quality is slightly negatively correlated giving an impression that higher acidity level leads to better wine. Is this really the case? The row 3 of the grid gives this information wherein Citric acid and fixed.acidity is positively correlated (even though fixed.acidity is weakly correlated). However volatile.acidity (VA) is bucking the trend. While one can understand how higher VA can be associated with lower quality, it is intriguing that higher VA does not result in lower pH.A little reading here helped. Apparently VA refers to the acidic elements of a wine that are gaseous, rather than liquid, and therefore can be sensed as a smell, Reference (https://www.decanter.com/learn/volatile-acidity-va-45532/#cbyaQ5UHej7Z1m1D.99)
From the above plots, a) residual.sugar have no strong bearing on the quality of wine b) density is weekly correlated with quality and seem to have negative correlation with quality which can be seen as we mve from quality grade 5 yo 8. c) density has a strong positive correlation with fixed.acidity, citric acid & residual sugar. note: geom_smooth() using method = ‘gam’ and formula ‘y ~ s(x, bs = “cs”)’ for density vs citric acid
Here I am looking to analyze using 2 or more features in order to dig further and clarify the understanding from the prior sections and it will be interesting to see if introducing third player reveals something that is interesting or unexpected.
From the corplot and bivariate analysis, we understood that sulphates and alcohol had a positive correlation with the quality of the red wine. The above chart elucidates this understanding as we see blue datapoints layered above yellow points which is layered above the red datapoints indicating that content of sulphates and alcohol seem to influence the quality of the wine. The bottom chart further provides insight into subtrend of each quality grade. There are outliers here of course.
In the top chart, there are some yellow datapoints above the blue points and further there is a red data point around 1 (g / dm^3) of citric acid which implies that higher citric acid do not necessarily mean better quality and there may be other factors that influence the quality. Additionally, from the alcohol/citric.acid analysis (bottom chart) indicates that it is alcohol that is dominant in influencing the quality as explained below
The clusters as along x-axis as evident from the media
The top chart shows that the variation in citric acid along with variation in sulphates does have some influence on the wine quality between the grades. This can be seem as there are clusters of green, light green and pink datapoints.
The bottom chart is very interesting. Here between the low grade red wine it appears that citric.acid does play a part. However within the mid and high grade wine sulphates content takes over and citric.acid has a weak correlation.
The above chart reveals interesting insight. We know from the bivariate analysis that Alcohol has a positive correlation with quality and that Volatile.acidity has a negative correlation with quality. When both alcohol & volatile.acidity is studied against each other: the trend is as expected for most part wherein we see high quality wine datapoint on the lower right quadrant of higher alcohol + lower volatile.acidity.
But looking at the last chart: i.e. VA vs alcohol between high grade wine, we see several datapoints with wine quality of 8 with high content of volatile.acidity. This trend was not apparent in bivariate analysis.
In the top chart, there seems clusters along yaxis where high quality wine have lower volatile.acidity. This also implies that density is weakly correlated to quality.
The bottom chart gives an interesting sub trend. Studying the VA vs density variation between wine quality of 7 & 8, we see datapoints with higher volatile.acidity pertaining to wine quality of 8. I was expecting to see them associated with quality of 7.
This variation in the trend could be because of one of the following factors:The above chart was helpful and provided a dashboard view to understand quality viz a viz density, fixed acidity, citric acid & residual sugar in addition to providing a view of density vs fixed acidity, citric acid & residual sugar. The take away from this is the inference that
<>
This plot clearly indicates the segmentation in the wine quality grade in relation to the two significant properties i.e. alcohol and sulphates. The high wine quality grade indicated by blue dots is seen in the right upper quadrant followed by the medium quality wine grade indicated by yellow with low quality wine grade at the bottom as indicated by the orange datapoints. This to me showed the variation in the input properties i.e. alchohol and sulphates and their plausible effect on the quality variable.
The same pattern could be seen in the bottom chart which further breaks the quality down to further individual granularity.
This inference is agains based on the given sample and it is prudent to caution that correlation do not imply causation here.
This chart shed very nice insight and provided the benefit of aggregating and drilling down into the data. The top chart provides an aggregated view of the alcohol and volatile.acidity by the quality gradation. It painted a nice picture of how fine quality wine had a lower volatile.acidity content and higher alcolhol as a rule of thumb. But breaking it down to the granular level of quality index, i was suprised that highest quality wine of 8 had a higher volatile.acidity content breaking my assumption built from carpot and univariate analysis.
Being a teetotaller, I approached this analysis with no prior subject knowledge perhaps it is a good thing as my only biased opinion was residual.sugar must be influencing the wine quality. But it turned out not to be :)
When I started analyzing the exploration, the inherent relationship among the variables were not intuitive. The phased structure of the exploration i.e. univariate, followed by bivariate and multivariate helped making a initial hypothesis and subsequently either validating or refining the hypothesis about the data.
Studying multiple variables helped scratch multicolinearity and did throw some unexpected result as explained in the above section.
For my future work, i would like see to study the following:Lastly, it would be nice to have a bigger sample size particulary across all quality bin to derive further insights.